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A metliod for processing of a speech signal 



Present iisi^endozi relates to ttie field of speech processing, and more 
partioularly "oidifliout lixoitatioxi, to the field of text-to-speeoh synthesis* 

The function of a tesxt-to-speeoh (TTS) synthesis system is to synthesize 



speech &pm a generic text in a givei^ laogoage. Nowadays, TTS systenis have been put into 

5 practical operation for many applications, sach as access to databases tiurou^ the telephone 
network or aid to Ixandicapped people. One method to synthesize ^eech is by concatenating 
elements of a recorded set of subunits of ispeech such as deml-s^dlables or polypbones. The 
majority of successM oommerdal systems employ the concatenation of polyphones. The 
polyphones comprise groups of two (diphones), three (triphones) or more phones and may be 

10 detemiined fiom nonsense words^ by segmenting the desired groioping of phones a£ stable 
spo^al regioxis. In a concatenation based ^thesis^ the conversation of the transition 
between two adjacent phones is crucial to assure the quality of the ^thesized speech. With 
the dxoice of polyphones as tbe basic subumts, the transition between two adjacent phones is 
preserved in tiie recorded subunitss and tixe concatenation is carried out between similar 

15 phones. Before the synthesis^ however» the phones must have their duration and pitch 

modified in order to fulfil the prosodic constraints of the new words containing those phones. 
This processing is nec^sary to avoid the production of a monotonous sounding synthesized 
speech. In a TTS systeon, this function is perfianned by apiosodic module. To allow the 
duration and pitch modifications in the recorded subunits, many concatenation based TTS 

20 systems employ the time-domain pitch-synchronous overlap-add (TD-FSOLA) (B. Moulines 
and F. Charpentier, ^^tch synchronous wavefimn processing techniques for texfr-to-^speech 
synthesis using diphones/' Speech Commnn., vol 9y pp. 4S3-467» 1990) model of synthesis. 
In the TD-PSOLA mode^ the speech signal is first submitted to a pitch marldng algorithm. 
This algorithm assigns marks at the pealcs of the signal in the voiced segments and assigns 

25 xnarlcs 10 ms apart in the unvoiced segments. The synthesis is made by a superposition of 
Harnnng windowed segments centered at the pitch marks and extending fiom the previous 
pitch mark to the next one. The duration modification is provided by deleting or replicating 
some of the windowed segments. The pitch period modification, on tiie otiier hand, is 
provided by increasing or decreasing tiie superposition between windowed segments, 
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Pespite fh& success achieved in trmy coxnmercial TTS systems^ the synthetic 
speeoli produced by using tbe TD-^FSOLA model of syuthesis cau present some drawbacks, 
mainly md^ large prosodic variations, outlined as follows. 



5 

Examples of such PSOLA methods are those defined in documents EP- 
0363233a U.S. Pat No. 5,479,564, EP-0706l70i A specifio example is also the MBR-PSOLA 
method as published by T, Dutoit and H. Leibh, in Speech ConimimicationSt Elsevier 
Publisher, November 1993. U.S. Pat No, 5,479,564 suggests a means of modifying the 

• o o « 

1 0 frequency of an audio signal witla constant fundamental frequency by overlap-adding short- 
temi signals extracted from this signal. The length of the weighting windows used to obtain 
the shorfrtetm signals is approTcimately equal to two times the peiiod of the audio signal and 
their position within the period can be set to any value (provided the time shifr between 
successive windows is equal to the p^od of the audio signal). Document U.$. Pat. No. 

15 5,479,564 also describes a means of inteipolatiQg waveforms between segments to 

concatenate so aa to smoolh out discontinuities, Such PSOLA methods enable to modify the 
duration of a given speech aignal. This is done by repeating or deleting pitch bells before an 
overlap and add operation is perfbimed £>r the speech synthesis. The infomiation in a pitch 
bdil is not always suitable for repetition like in a plosive sound. B is a common disadvantage 

20 of prior art PSOLA melhods ^t artefacts are introduced this way. These arte&cts can lead to 
a metallio sound of ihe iQnoChesized speech signal and can even seriously affect or destroy the 

• ■ • • • 

inteUigpibilily of the synCh^izsed signal. 

25 The present invention therefore aims to provide an improved method fiir 

processing of a speech signal. . 

The present invention provides a method^ a computer program product and a 
computer system for processing of a speech signaL In essence, the present invention enables 
to synlhesize a natural sounding syntiiesized speech signal with impro vied intelligibili^. 
30 This is accomplished by clas^i^g certain intervals contained in Che original 

speech signal. In accordance with apre&qr^ed embodiment of the invention ^st^y' and 
dynamic' intervals are identified within the original speech signaL This classification needs 
to be perfomied only once. Jt is utilized for synthesizing a speech signal based on the original 
speech signal with a modified duration. 
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The present toventicm is based on the observation that ftie i:epetition of pitch 
bells form dynaamo intervals, as it is done in prior art PSOLA methods, introduces an 
unintentional periodicity which leads to artrfacts, such as a metalfic sounding synthesized 
signal^ and to reduced or destroyed intelligibility. 

In accordance with Ihe present invention this problem is solved by restricting 
the processing of pitch bells for the pmpose of duration modification to pitch bells of steady 
intervals of the original speech signal. In o&er words duration modifications are only 
performed on those speech intervals which can have different durations. This is true for the 
middle of a vowel or a consonant like the hi sound. Bu&there are cases where locat events 
occur that last less tiian a single period. Th^e are sudden changes Hke the start of an 
unvoiced plosive (/p/, hJ^ IkJ) or the ticks and clicks produced by Oie tongues and the mouCh 
(/b/, /dA Isf^ IxcJt /q/, etc,)» Periods containing th^e events are important for intelligibility 
and should not be omitted by manipulation.. Repeating them is ateo a problem sixice this 
introduces arteftcts that sound unnaturaL Also the periods at the start of a transition fsxstn an 
unvoiced sound to a vowel have local features that should not be made longer or shorter. To 
avoid artefacts^ all periods are marked with a special period class-type information. This 
information is used to detetmine whether a period can be repeated or omitted. Henoe, pitch 
beUs whidi are obtained by windowing of dynamic intervals of the original speedi signal are 
not repeated for duration modification. Pitch bells which are obtained fiom intervals which 
are classified as dynamic and of braig essential for the mtelliglbilily are kept in the 
synthesized signal in order to maintaui intelligibility. Pitch bells which are obtained by 
windowing of intervals of the original speech signal which are classified as dynamic but as 
not bdng essential for intelligibility may or may not be deleted before performing the overlap 
and add operation without seriously affecting the quality of the resulting synthesized speech 
signal. 

A preferred application of the pres^ invention is for text-to-speech systems 
which store a large number of natural ^eech recordings which are modified in the process of 
text-to-speech syn&esis. 

In accordance with a preferred embodiment of the invention a raised cosine 
window is used for fc^ windowing of the speech signaL Preferably a sine window is used for 
steady intervals containing unvoiced speech. The pitch bells obtained for such steady 
intervals containing unvoiced speech are randonaized in order in remove any unintended 
pedodicity which can be introduced in the process of duration modification. 



BEST AVAILABLE COPY 



10 



20 



25 



PHNL020858EPP 




010 17.09.2002 10:58 



17^09.2002 



la the foEowing prefeixed embodiments of the invention will be describecl in 
greater detail by making reticence to the diawings in whicht 

Fig. 1 is illustrative of a flow chart of a preferred embodiment of the present 

5 invention. 

Fig. 2 is illustrative of the synthesis of a speedh signal based on an original 
^GQch signal in accordance with an embodiment of the present invention. 

Fig< 3 is a blodc diagram of an embodiment of a computer system of the 



mvi 



v)iiirti» 



Fig. 1 shows a flow diagram to illustrate a preferred embodimmt of a method 
of the invention. In step 100 a recording of natural speech is provided. In step 102 intervals in 
the natural speech recording are identified and cls^Mfied. For the classification of the speech 
IS intervals the following classification system is used in the example considered here: 

— — silence 



unvoiced period 

- voiced period 

- crucial dynamic unvoiced period (should only be uaed once) 

- crucial dynamic voiced pedod (should only be used once) 

- dynamic unvoiced period (may only he used once) 
^ dynamic voiced period (may only be used once) 

The two basic categories of ^eech intervals are 'steady^ and Mynaznic^ speech 
intervals. A speech int^al is classified as 'steady' when it has an essentially constant signal 
characteristic £br a consecutive number of at least two pis^ods of the fundamental fiequency 
^f^eTtatin:al"SpeedrsigDai7J^ spee ch rec ord mgis" 



v 

P 
b 

c 



dassified as 'dynamic' when it^s signal oharacteristic only occurs witbin one p^od of the 
flmdamental firequenpy. 

In the classification system considered hrae the and 'v' periods are steady 
30 periods. The 'p\ 'b\ and 'c^ periods are dynamic periods which are treated differently in 
the stibsequent processing. 

In step 1 04 the natoral speedi signal is windowed to obtain pitch bells. 
Pre&iiably the windowing is performed by means of a raised cosine window or with a sine 
window for Ae \' periods- 
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In step 106 the pitch bells which are obtained &>r periods which are classified 
as ^ steady^ are processed in order to modiiy the duration of the speech signal. This can be 
done by repeating or deleting of pitch bells to increase or decrease the original duration^ 
respectively- Pitch bells which are obtained from periods which are classified as Mynaixuc'' 
5 are not repeated in order to avoid the inttroduction of artiik^ts. Pitch bells which have been 
obtained fiom periods which are classified as 'p' or ^b^ can not be deleted in order to 
iriAjn^-*™ the xntelUgibilily of file original signal. Pitch bells which are obtained for periods 
which are classified as ^q^ or ^c^ are also not repeated, but can be deleted without seriously 
effecting the intelHgibility of the resulting synthesized signal 
10 Prefoably pitch bells forperiods whidi are classified as are obtained in a 

randomized way in order to avoid the inttodoction of periodicity. This is fittther helped by 
the usage of a sine whodow for the windofwing of those periods. 

In step 108 the processed pitch bdls are overlapped and added in order to 

obtain the synthesized signal. 

Fig. 2 is illustrative of an example ftnr the processing of a natural speech signal 
20a Th© naturgl speech signal 200 has dynamic intervals 202, 204, 206, 208, 210 and 212. 
The dynamic interval 202 contains periods which are classified as 'b', 'c\ The dynamic 
interval 204 contains periods which are classified ss \\ The dynamic interval 206 
contains periods which are classified as The dynamic interval 208 contains periods whidi 
20 are classified as ^c' and V. The dynamic interval 210 contains periods which are 

classified as *o\ V, Fmally the dynamic interval 212 contains periods which are classified as 
^o' and 'b\ Further the natural speech signal 200 contains steady intervals 214, 216, 218, 
220, 222 and 224, The steady interval 2X4 oontaios periods which are classified as V; the 
steady intecval 216 contains periods which are classified as \'; fee st^dy iatearval 21 8 
25 contains periods -whidi are clasaified as \^ the steady interval 220 contains periods which are 
classified as V; the at^y interval 222 oontaios periods which are classified as ' v' and the 
steady interval 224 ocmtBixs pedods vMch are classified as V . This clarification can be 
peifinmed either mannally or automsdcaUy by means of an ^propiiate signal malym 
program. Preferably an aatomatic analysis is perfcmed by means of such a pGCOgtam which is 
30 then corrixoUed by a hmm expert and manually corrected, if nec«isaiy. It is to be noted that 
this classificadott needs to be petftomed only once m ord« to enable an iwlimited number of 
signal syntheses. 

In the example considered here a signal is to be synthesized based on the 
natural aneedb sianal 200 whicb has an extended duration, as compared to liie origfaial speech 
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signal 200. For this piupose the natural speech signal 200 is iviiulowed by means of a 
window positioned synchronously with the fendmnental frequency of the natural speech 
signal 200 as It as such known &om the pxior art and used in FSOLA type meHiods. 

Preferably a raised cosine is used as window- For periods whicli are classified 
5 as a sine window is used in order to reduce unintended periodicity which may be 

introduced when pitch bells of tiie noisy signal portion are repeated As a further measure 
against unintended periodicity the pitch bells for the \' classified periods are acquired in a 
randomized way. In the example considered here the $ignal to be synthesized is coniposed as 
follows in the domain of the time axis 226; 
10 The fust interval 228 of ihe speech signal to be synthesized contains the pitch 
bells &om the dynamic interval 202. These pitch bells are used for the interval 228 without 
modification which implies that the duration of the interval 228 is unchanged with respect to 
the dynamic int^val 202. The duraiion of the interval 230 is about twice the duiaiion of the 
coxiesponding steady interval 214. This is accomplished by repeating each of the pitch bells 
1 5 acquired for the steady interval 214. hiterval 232 contains the pit^ bells from the dyoamic 
interval 204. The duration of 232 is unchanged as con^ared to the dynamic interval 204, 
Interval 234 is co[astituted by pitch bells acquired jfrom steady intaval 216. Again each of the 
pitch balls contained in the steady interval 216 is repeated in order to double the duraiion of 
this interval. Likewise the following intervals 236* 238, 240, 242, ...are obteined firom ihe 
20 intervals 206, 218, 208, 220, 210, 222, 212, 242. Next the pitch bells are overlapped in the 
domain of the time axis 226 in order to obtain the resulting synthesized signal. Alternatively 
Ihe pitch bells obtained from the periods of the natural speech signal 200 whicli are classified 
as 'q^ or ^c^ can be deleted. In any case none of the pitch bells which are ob^ned ftom 
periods of the natural speech signal 200 which are classi0ed as dynamic' are repeated. Haas 
25 way a duration modification can. be per&rmed without introducing arti&cts which would 
0th6rwise-seriouslyiu:{>actti:i6-qaaU^ 



In the exazDple considered here 'p' is used lo mark local (unvoiced) events fi^at 
ate crucial for the intelligibility of the spoken utterance Usually^^ the noise burst after the 
release of air by the mouth or ibQ tangfiQ is of this type. The phoneme$ /p/^ /t/ and /W have at 
30 least ozjie such period. Periods marked with ^p ' should appear only once at the synlfaesized 
speech, regardless of the final duration of the phoneme. Seme local (unvoiced) events are not 
crucial for intelligibility but are so dynamic that repeating them would introduce a series of 
unnatural sounding periods. These p^ods are marked with the letter They may only be 
used once, but they can also be omitted without a m^or degradation in quality or 
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intelligibflity. The voiced coimteiparts for 'p' and 'q' are the types demoted by *b' an 
Tbo voiced plosives /b/, /d/ and Is/ usually have at least one period marked with 'b'. Also the 
tongue can. produce tick and cliok sounds when it bite or leaves other parts of the mouth. The 
phoneme /!/ is an example where this can happen. T!h» transition firom silence to vowels or 
5 fiom unvoiced oonsonante to vowels also have periods with local events. Although the 
periods in the middle of a vowel can be repeated many times without affecting the 
natuiatosss, fiie periods that fill right in the middle of the tranation are too dynamic for 
repetition. 

, Mg. 3 shows ablock diagram of an embodiment of a con^puter system of the 
10 inventton. Preferably the computer system Is a texfr-to-speedi system irfdch embodies the 
principles of the present invention. Hh& computer system 300 has a module 302 which serves 
to store natnral speech signals. Module 304 serves to automatically, manually or interactively 
classify periods of the natural speeeh signals stored in the module 302. Modula 306 serves to 
pecfbrm the windowing of a natnral speech signal stored in the module 302. This way a 
15 number of pitch bells are obtained. Module 308 serves &f pitch beU processing. The pitch 
beU processing for duration modification is only performed on pitch bells which are obtained. 
&om immals which are classified as steady. la addition pitch bells team dynamic intervals 
which are classified as not being essential for the intelUgrbility can be deleted by module 308. 
sudi that tijey do not occur in the synthesized signal. Module 3 1 0 serves to perform an 
20 overifl5» and add operation of the resulting pitch bells in order to obtain the syntheaizwd signal. 
The desired modification of the duration of the origmal natural speech signal stored in 
module 302 is inputted mto the conrputer system 300. The resulting synthesized signal is 
outputied ftom the computer system 300 on a candor wave or as a data file. 
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LIST OP RSPBRENCE NUMERALS: 



200 natural speech signal 

202 dynamic iDlerval 

204 dynaniio interval 

206 dynataic interval 

•5 208 dynamio mterval 

210 dynamic interval * 

212 dynamic interval 

214 steady interval 

216 steady interval 

10 21 8 steady interval 

220 steady interval 

222 steady interval 

224 steady interval 

226 time axis int^al 

15 230 interval 

232 interval 

234 interval 

236 interval 

238 interval 

20 240 interval 

242 interval 
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300 computer system 

302 module 

304 module 

25 306 module 

308 module 

310 module 
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1 , A method of syathesizing of a speedi signal^ comprising: 

asslgaing of a fix^ identLiQer to a first class of intervals of an oiijgmal ^eech 
signal and assigning of a second identifieir to a second class of intervals of the original speech 
signal^ 

5 , « windowing the Qtiginal speech signal to pro^d 

pxocesdng the pitch bells having the first identLQier assigned thereto for 

modigdng a duration of the speech signal^ 

pec&iming an overly and add operation on the pix)cessed pitch bells. 

IQ 2. The method of claim 1> the first class of intervals being steady intervals. 

3. The mefliod of claim 1 or 2 a first code or a secotid code bdng used as the first 
Identifier, the first code being indicative of an nnyoiced interval and second code being 
indicative of a voiced interval. 

15 

4. The mefliod of claim 1 , 2 or 3 tfie second class of intervals b^g dynamic 
intervals. 

5. The method of any one of ihe preceding claims 1 to 4, whereby a third code;» a 
20 fourth code, a flfih code or a sixth code is used as the second identifier^ the third code being 

indicative of an unvoiced interval being essential &r the intelligibility of the speech signal, 
the fonitibt code being indicative of a voiced interval being essential for ihe intelligibility of 
the speech signal, and the fifEh code being indicative of an unvoiced interval not b^og 
essential for the mteUiglbility of Ihe speech signal and the sixth code being indicative of a 
25 voiced interval not being essendal for the intelligibility of the speech signal. 

The method of claim 5 wh^eby pitch bells being as^gned to the fifih or sixth 
code are dieted optionally* 
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7. The method of any one of the preceding claims 1 to 6 whereby a rmaed cosine 
is used for windowing of the speech signal, 

8. The method of any one of the preceding claims 1 to 7, a sine window being 
5 used &>T windowing of steady^ unvoiced int^als of the $peech signal. 



9, The methods of any on of the preceding claims 1 to 7 Ibrtfaer comprising 

randomisiing the pitch bells of steady^ imvoioed periods before perfoxming the overlap and 
add operation. 

10 

10* The mediod of any on.e of the precedmg claims 1 to 9, whereby the windowmg 

is perfbmied by means of a window positioned synchronously with a fundamental frequency 
of the speech signal. 

15 11. Computer program product^ such as a digital storage medium^ the computer 

program product comprising program means for per&inning the following processing steps 
&x the modification of a duration of an original speech ^gnal: 

assignixig of a first identifier to a first class of intervals of an original speech 
signal and assigning of a second identifier to a second class of intervals of the original speech 

20 signal, 

windowing the original speech aignal to provide a number of pitch beU$« 
processing the pitch bells having the first ideritifier assigned thereto for 
modi^^ing a duration of the speech signal, 

perfoxming an overlap and add operation on the processed pitch bells. 

25 

13; Go^u^B^^-^fystemym-psi^^ 

means (302) for storing of a ^eech signal, 

means (304) for storing of first identifiers being assigned to a first class of 
intervals of an original speech signal and for storing of a second identifiers bdng assigned to 
30 a second class of intamls of the cmginal speech signal, 

means (306) for windowing the speech signal to provide a number of pitc^ 

bells, 

means (308) for processing the pitch bells having the first identifier assigned 
thereto for modifying a duration of the speech signal. 
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means (3 10) for perfommig atn overly and add operation on the processed 



pitch bells. 



13 . A synfltiesizcd speedhi signal being composed of pitch bells, ivhich are 

5 overlapped and added, whereby only pitch bells of steady voiced or unvoiced intervals of an 
origixial speech signal have been processed in order to accomplish a duration modification of 
the origixial speech signal. 



14, The speech signal of claim 13 whereby one or more pitch bells belonging to a 

* • • ^ 

10 dynamic voice or unvoiced inteival have been deleted prior to the overlap and add operation- 
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ABSTRACT; 



The present invention relates to a method of fiynthesLzdng of a speech signal, 

compiising: * 

d^gzung of a &xst identLQer to a fir^t class of intervals of an original speech 
signal and assigning of a second identifier to a second class of xnteivals of the original speech 
5 signal^ • <> 



windowing the original speech signal to provide a number of pitch bells» 
processing the pitch bells having the first identifier assigned thereto for 



modii^g a duration of the speech signal 

perfomung an ov^Iap and add op^aration on the processed pitch bells. 



10 
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